Backgroud

While locations of 2020, 2024, and 2028 Summer Olympics are decided, (Tokyo, Paris, and Los Angeles in order), 2032’s location is not decided and its bidding process has not begun yet. Delhi is very serious in entering the bid, and in preparation, its committee wants to build a model that can predict weather to assess its feasibility.

Weather, Climate and Prediction for the Olympic Games

Baseline climatological information includes temperature, relative humidity, wind speed and direction, precipitation, and visibility. Several additional climatological elements for the summer games can include lightning, heat index, UV index, sea breeze, and ocean currents. While most climatological data is available from existing observation stations usually. These are guidelines from World Meteorological Organization. https://www.wmo.int/pages/prog/amp/pwsp/documents/PWS_Olympic_Guideline_Draft-Ver3.pdf

Objective

Although its specific timing is unknown, as Summer Olympics took places between April and October in the past, being able to predict temperature and humidity of the city of New Delhi in such time frame will help plan the bid.

Dataset

There is a dataset available for past 20+ years weather on hourly basis for New Delhi, India. https://www.kaggle.com/mahirkukreja/delhi-weather-data/downloads/delhi-weather-data.zip/2

Data Exploration and Preparation

Data Exploration

About the dataset: 100990 rows(observations) and 20 columns(variables)

## [1] 100990
## [1] 20
##  [1] "datetime_utc" "X_conds"      "X_dewptm"     "X_fog"       
##  [5] "X_hail"       "X_heatindexm" "X_hum"        "X_precipm"   
##  [9] "X_pressurem"  "X_rain"       "X_snow"       "X_tempm"     
## [13] "X_thunder"    "X_tornado"    "X_vism"       "X_wdird"     
## [17] "X_wdire"      "X_wgustm"     "X_windchillm" "X_wspdm"

Missing values

##  datetime_utc         X_conds             X_dewptm          X_fog        
##  Length:100990      Length:100990      Min.   :-24.00   Min.   :0.00000  
##  Class :character   Class :character   1st Qu.: 10.00   1st Qu.:0.00000  
##  Mode  :character   Mode  :character   Median : 15.00   Median :0.00000  
##                                        Mean   : 15.74   Mean   :0.06969  
##                                        3rd Qu.: 22.00   3rd Qu.:0.00000  
##                                        Max.   : 75.00   Max.   :1.00000  
##                                        NA's   :621                       
##      X_hail           X_heatindexm      X_hum           X_precipm     
##  Min.   :0.0000000   Min.   :26.80   Length:100990      Mode:logical  
##  1st Qu.:0.0000000   1st Qu.:31.70   Class :character   NA's:100990   
##  Median :0.0000000   Median :35.10   Mode  :character                 
##  Mean   :0.0001287   Mean   :35.65                                    
##  3rd Qu.:0.0000000   3rd Qu.:39.20                                    
##  Max.   :1.0000000   Max.   :73.60                                    
##                      NA's   :71835                                    
##   X_pressurem            X_rain            X_snow           X_tempm     
##  Min.   :    -9999   Min.   :0.00000   Min.   :0.0e+00   Min.   : 1.00  
##  1st Qu.:     1002   1st Qu.:0.00000   1st Qu.:0.0e+00   1st Qu.:19.00  
##  Median :     1008   Median :0.00000   Median :0.0e+00   Median :27.00  
##  Mean   :     1932   Mean   :0.02626   Mean   :9.9e-06   Mean   :25.45  
##  3rd Qu.:     1014   3rd Qu.:0.00000   3rd Qu.:0.0e+00   3rd Qu.:32.00  
##  Max.   :101061443   Max.   :1.00000   Max.   :1.0e+00   Max.   :90.00  
##  NA's   :232                                             NA's   :673    
##    X_thunder          X_tornado            X_vism            X_wdird     
##  Min.   :0.000000   Min.   :0.00e+00   Min.   :   0.000   Min.   :  0.0  
##  1st Qu.:0.000000   1st Qu.:0.00e+00   1st Qu.:   1.500   1st Qu.: 50.0  
##  Median :0.000000   Median :0.00e+00   Median :   2.000   Median :160.0  
##  Mean   :0.009427   Mean   :1.98e-05   Mean   :   2.403   Mean   :163.6  
##  3rd Qu.:0.000000   3rd Qu.:0.00e+00   3rd Qu.:   3.000   3rd Qu.:270.0  
##  Max.   :1.000000   Max.   :1.00e+00   Max.   :6436.000   Max.   :960.0  
##                                        NA's   :4428       NA's   :14755  
##    X_wdire             X_wgustm      X_windchillm       X_wspdm        
##  Length:100990      Min.   :25.90   Min.   :2.10     Min.   :   0.000  
##  Class :character   1st Qu.:33.30   1st Qu.:4.90     1st Qu.:   0.000  
##  Mode  :character   Median :37.00   Median :6.10     Median :   7.400  
##                     Mean   :37.67   Mean   :5.71     Mean   :   7.678  
##                     3rd Qu.:40.70   3rd Qu.:6.80     3rd Qu.:  11.100  
##                     Max.   :92.60   Max.   :7.30     Max.   :1514.900  
##                     NA's   :99918   NA's   :100411   NA's   :2358

Missing Values

Column Name Number of NA’s % of NA’s
X_dewptm 621 .61%
X_heatindexm 71835 71%
X_precipm 100990 100%
X_pressurem 232 .23%
X_tempm 673 .66%
X_vism 4428 4.4%
X_wdird 14755 14.6%
X_wgustm 99918 98.9%
X_windchillm 100411 99.4%
X_wspdm 2358 2.33%

Deal with some maximum numbers in variable X_tempm

While temperature of the city (X_tempm) is #1 point of interest in this exercise, from summary of the dataset, we found maximum number is 90 in the variable X_tempm. We need to diagose if this number is correct or not?

##         datetime_utc        X_conds X_dewptm X_fog X_hail X_heatindexm
## 91392 20150624-03:00 Patches of Fog       25     1      0           NA
##       X_hum X_precipm X_pressurem X_rain X_snow X_tempm X_thunder
## 91392     4        NA         997      0      0      90         0
##       X_tornado X_vism X_wdird X_wdire X_wgustm X_windchillm X_wspdm
## 91392         0    0.5      NA               NA           NA       0

In the temperature column, the temperature numbers were in Celsius. If we think 90 is in Celsius, this is not possible for a temperature in New Delhi. We checked other time on Jun 24 to see how tempeature look like

##         datetime_utc X_conds X_dewptm X_fog X_hail X_heatindexm X_hum
## 91393 20150624-06:00   Smoke       26     0      0           NA    56
##       X_precipm X_pressurem X_rain X_snow X_tempm X_thunder X_tornado
## 91393        NA         997      0      0      34         0         0
##       X_vism X_wdird X_wdire X_wgustm X_windchillm X_wspdm
## 91393      1      50      NE       NA           NA     9.3

After we checked the temperature on Jun 24 at 6:00am was 34 in this dataset, we decided that it was reasonable to assume that 90 was in Fahrenheit. Therefore, we changed it to Celsius which is 32 Celsius.

##         datetime_utc        X_conds X_dewptm X_fog X_hail X_heatindexm
## 91392 20150624-03:00 Patches of Fog       25     1      0           NA
##       X_hum X_precipm X_pressurem X_rain X_snow X_tempm X_thunder
## 91392     4        NA         997      0      0      32         0
##       X_tornado X_vism X_wdird X_wdire X_wgustm X_windchillm X_wspdm
## 91392         0    0.5      NA               NA           NA       0
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##    1.00   19.00   27.00   25.45   32.00   72.00     673
##         datetime_utc X_conds X_dewptm X_fog X_hail X_heatindexm X_hum
## 11085 19981125-01:30   Smoke        6     0      0           NA     4
##       X_precipm X_pressurem X_rain X_snow X_tempm X_thunder X_tornado
## 11085        NA        1015      0      0      72         0         0
##       X_vism X_wdird X_wdire X_wgustm X_windchillm X_wspdm
## 11085    2.8     260    West       NA           NA     3.7

We assume that the 72 was also in Fahrenheit. We will convert it to Celsius which will be 22.

##         datetime_utc X_conds X_dewptm X_fog X_hail X_heatindexm X_hum
## 11085 19981125-01:30   Smoke        6     0      0           NA     4
##       X_precipm X_pressurem X_rain X_snow X_tempm X_thunder X_tornado
## 11085        NA        1015      0      0      22         0         0
##       X_vism X_wdird X_wdire X_wgustm X_windchillm X_wspdm
## 11085    2.8     260    West       NA           NA     3.7
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##    1.00   19.00   27.00   25.45   32.00   63.00     673
##         datetime_utc X_conds X_dewptm X_fog X_hail X_heatindexm X_hum
## 12952 19990529-11:30    Haze        7     0      0           NA     4
##       X_precipm X_pressurem X_rain X_snow X_tempm X_thunder X_tornado
## 12952        NA         995      0      0      63         0         0
##       X_vism X_wdird X_wdire X_wgustm X_windchillm X_wspdm
## 12952    4.5     300     WNW       NA           NA    18.5

The 63 could be a typo as temperature in May in New Delhi usually around 40 / 20(high/low). So we check other time on May 29 to see what the temperature recorded in the dataset. We found the temperature at 12:30 is 42, so we will replace the 63 by 42.

##         datetime_utc X_conds X_dewptm X_fog X_hail X_heatindexm X_hum
## 12953 19990529-12:30    Haze        7     0      0           NA    12
##       X_precipm X_pressurem X_rain X_snow X_tempm X_thunder X_tornado
## 12953        NA         994      0      0      42         0         0
##       X_vism X_wdird X_wdire X_wgustm X_windchillm X_wspdm
## 12953      4     270    West       NA           NA    14.8
##         datetime_utc X_conds X_dewptm X_fog X_hail X_heatindexm X_hum
## 12952 19990529-11:30    Haze        7     0      0           NA     4
##       X_precipm X_pressurem X_rain X_snow X_tempm X_thunder X_tornado
## 12952        NA         995      0      0      42         0         0
##       X_vism X_wdird X_wdire X_wgustm X_windchillm X_wspdm
## 12952    4.5     300     WNW       NA           NA    18.5
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##    1.00   19.00   27.00   25.45   32.00   62.00     673
##         datetime_utc X_conds X_dewptm X_fog X_hail X_heatindexm X_hum
## 68078 20061231-06:00   Smoke        9     0      0           NA     4
##       X_precipm X_pressurem X_rain X_snow X_tempm X_thunder X_tornado
## 68078        NA        1018      0      0      62         0         0
##       X_vism X_wdird X_wdire X_wgustm X_windchillm X_wspdm
## 68078      1     180   South       NA           NA     7.4
##         datetime_utc     X_conds X_dewptm X_fog X_hail X_heatindexm X_hum
## 68077 20061231-03:00 Partial Fog        8     1      0           NA    92
##       X_precipm X_pressurem X_rain X_snow X_tempm X_thunder X_tornado
## 68077        NA        1017      0      0       9         0         0
##       X_vism X_wdird X_wdire X_wgustm X_windchillm X_wspdm
## 68077    0.2      NA               NA           NA       0
##         datetime_utc X_conds X_dewptm X_fog X_hail X_heatindexm X_hum
## 68079 20061231-09:00   Smoke       14     0      0           NA    59
##       X_precipm X_pressurem X_rain X_snow X_tempm X_thunder X_tornado
## 68079        NA        1014      0      0      21         0         0
##       X_vism X_wdird X_wdire X_wgustm X_windchillm X_wspdm
## 68079      1     230      SW       NA           NA     1.9

After we checked one tempeature before and after 6:00AM, we belive the 62 was in Fahrenheit, therefore, we will change it to 17 in Celsius.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##    1.00   19.00   27.00   25.45   32.00   47.00     673

We checked tempeature history of New Delhi online and found that these days show above had high tempearture around 47. Since there were more than 10 observations, we will assume that it is reasonale and proceed without further value adjustments.

Deal with NA’s in the variable X_tempm

Replaceing Missing data: mean imputation method

## [1] 25.44953
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1.00   19.00   27.00   25.45   32.00   47.00

There is no more NA’s.

Data Preparation - Subset dataset

Remove variable X_heatindexm, X_precipm, X_wgustm, X_windchillm due to the high percentage of NA’s in these variables

Subset dataset with only three variables:datetime_utc, X_hum and X_tempm.

## [1] "integer"
##  datetime_utc           X_hum           X_tempm     
##  Length:100990      Min.   :  4.00   Min.   : 1.00  
##  Class :character   1st Qu.: 39.00   1st Qu.:19.00  
##  Mode  :character   Median : 59.00   Median :27.00  
##                     Mean   : 57.91   Mean   :25.45  
##                     3rd Qu.: 78.00   3rd Qu.:32.00  
##                     Max.   :243.00   Max.   :47.00  
##                     NA's   :757

When we convert the character to integer, NA’s introduced by coercion. Therefore, we will need to deal NA’s in variable X_hum. We also need to check if the maximum number 243 is reasonable number?

Check the maximum number in X_hum

##         datetime_utc X_hum X_tempm
## 95249 20160607-03:00   243      32
##         datetime_utc X_hum X_tempm
## 95248 20160607-02:00    66      31
##         datetime_utc X_hum X_tempm
## 95250 20160607-04:00    56      34

We also checked online at link: [https://www.wunderground.com/history/daily/in/new-delhi/VIDP/date/2016-6-7]

So the 243 is an error number. We will replace it with the number 55 which is from online.

Summary the X_hum to see if all numbers make sense.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##    4.00   39.00   59.00   57.91   78.00  225.00     757
##         datetime_utc X_hum X_tempm
## 61571 20040929-15:00   225      27
##         datetime_utc X_hum X_tempm
## 61570 20040929-12:00    42      31
##         datetime_utc X_hum X_tempm
## 61572 20040929-18:00    77      26

We also checked online at link: [https://www.wunderground.com/history/daily/in/new-delhi/VIDP/date/2004-9-29]

So the 225 is an error number. We will replace it with the number 43 which is from online.

Summary the X_hum to see if all numbers make sense

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##    4.00   39.00   59.00   57.91   78.00  135.00     757
##         datetime_utc X_hum X_tempm
## 73297 20081027-21:00   135      20
##         datetime_utc X_hum X_tempm
## 73296 20081027-18:00    55      21

There is no one after 21:00 on that day. We checked online at link: [https://www.wunderground.com/history/daily/in/new-delhi/VIDP/date/2008-10-27] and found the humidity number is 57 for 20081027-21:00. So, we will replace the 135 with 57.

Summary the X_hum to see if all numbers make sense

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##    4.00   39.00   59.00   57.91   78.00  100.00     757

Now we see the maximum humidity number is 100. Is it possible? Surprisingly, yes, the condition is known as supersaturation. At any given temperature and air pressure, a specific maximum amount of water vapor in the air will produce a relative humidity (RH) of 100 percent

Deal with NA’s in variable X_hum: Mean Imputation Method

## [1] 57.90501
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    4.00   39.00   58.00   57.91   77.00  100.00

Deal with variable datetime_utc

## [1] "character"
## [1] "Date"
## [1] "1996-11-01" "1996-11-01" "1996-11-01" "1996-11-01" "1996-11-01"
## [6] "1996-11-01"
## [1] "2017-04-24" "2017-04-24" "2017-04-24" "2017-04-24" "2017-04-24"
## [6] "2017-04-24"

Build up model

Plots for temperature

As time series analysis requires a series of equally disperse time points, the dataset is aggregated by date by mean, in order to have one observation per day. Then use ggplot to visualize the daily temperature data

The new data set is plotted for review. The dataset seems pretty stationary with seasonality.

Decomposing the daily temperature time series

Now, the dataset will be decomposed for further analysis. Not a significant trend exits but seasonality is definitely there.

Checking whether the dataset is stationary

Now the dataset’s stationarity is being check to see whether this is suitable for ARIMA model. And the test results indicate that this is stationary.

## 
##  Augmented Dickey-Fuller Test
## 
## data:  tstemp
## Dickey-Fuller = -5.013, Lag order = 19, p-value = 0.01
## alternative hypothesis: stationary

Building an ARIMA model

While running ACF and PACF to help find p, d, q value, it is noticed that there are too many observations in the dataset.

For now, an ARIMA model is being developed with auto. arima function with existing dateset and its result is being reviewed

## Series: tstemp 
## ARIMA(2,0,1)(0,1,0)[365] 
## 
## Coefficients:
##          ar1      ar2      ma1
##       1.6188  -0.6255  -0.9008
## s.e.  0.0171   0.0162   0.0113
## 
## sigma^2 estimated as 5.459:  log likelihood=-15813.33
## AIC=31634.67   AICc=31634.67   BIC=31662.07

Forecast for next 3000 days

Aggregate the dataset to the monthly temperature time series

Although previous model(daily) produces a reasonable model, it seems to have some readability issue in test results, in addition to computation time. We are aggregating the dataset to monthly mean to develop another model.

We notice that curves are smoother and the dataset is still more or less stationary.

## 
##  Augmented Dickey-Fuller Test
## 
## data:  tstemp2
## Dickey-Fuller = -16.673, Lag order = 6, p-value = 0.01
## alternative hypothesis: stationary

For ACF and PACF tests, they are conceptually the same as daily dataset’s but gotten easier to read.

## Series: tstemp2 
## ARIMA(3,0,3)(0,1,1)[12] with drift 
## 
## Coefficients:
##          ar1     ar2      ar3      ma1      ma2     ma3     sma1   drift
##       0.6128  0.2931  -0.3083  -0.2902  -0.2260  0.0685  -0.6233  0.0059
## s.e.  0.3151  0.3544   0.2534   0.3212   0.3085  0.2208   0.0816  0.0059
## 
## sigma^2 estimated as 3.561:  log likelihood=-477.55
## AIC=973.09   AICc=973.9   BIC=1004.15

Since there are lags over level of significance still being noticed, an ARIMA model with manual order will be tried to improve accuracy.

Monthly temperature: Manual ARIMA

## 
## Call:
## arima(x = tstemp2, order = c(10, 0, 10), seasonal = list(order = c(0, 1, 1)))
## 
## Coefficients:
##          ar1      ar2     ar3      ar4     ar5      ar6     ar7      ar8
##       0.6505  -0.0683  0.2018  -0.3807  0.1023  -0.0680  0.3535  -0.5979
## s.e.  0.8356   0.8701  0.4213   0.3477  0.4130   0.2597  0.2062   0.3860
##          ar9    ar10      ma1     ma2      ma3     ma4      ma5     ma6
##       0.3133  0.3496  -0.3211  0.1346  -0.2757  0.3811  -0.0098  0.2814
## s.e.  0.6835  0.6089   0.8498  0.5892   0.3464  0.3504   0.4066  0.1962
##           ma7     ma8      ma9     ma10     sma1
##       -0.4703  0.4528  -0.4411  -0.2669  -0.8459
## s.e.   0.3184  0.4971   0.5960   0.6385   0.0732
## 
## sigma^2 estimated as 2.592:  log likelihood = -453.99,  aic = 951.98

Since the model seems to have better accuracy, now the model is being tested for its forecast accuracy with actual value. For this, first 200 values are being used to train the model and remaining actual 45 values are being compared to forecast.

Build model for variable humidity

Although it is more or less redundant, as humidity also plays a big role in Summer weather condition, another ARIMA model is being built for a humidity prediction model.

## 
##  Augmented Dickey-Fuller Test
## 
## data:  tshum
## Dickey-Fuller = -8.0112, Lag order = 19, p-value = 0.01
## alternative hypothesis: stationary

## 
##  Augmented Dickey-Fuller Test
## 
## data:  tshum2
## Dickey-Fuller = -8.0591, Lag order = 6, p-value = 0.01
## alternative hypothesis: stationary

## Series: tshum2 
## ARIMA(3,0,2)(0,1,1)[12] 
## 
## Coefficients:
##          ar1     ar2      ar3      ma1      ma2     sma1
##       0.5360  0.5231  -0.2334  -0.1125  -0.5243  -0.7908
## s.e.  0.5621  0.4317   0.1965   0.5531   0.3808   0.0681
## 
## sigma^2 estimated as 76.76:  log likelihood=-839.22
## AIC=1692.44   AICc=1692.94   BIC=1716.6

## 
## Call:
## arima(x = tshum2, order = c(11, 0, 11), seasonal = list(order = c(0, 1, 1)))
## 
## Coefficients:
##           ar1      ar2      ar3     ar4      ar5     ar6     ar7     ar8
##       -0.2813  -0.3619  -0.0340  0.1810  -0.0844  0.3095  0.3501  0.3623
## s.e.      NaN   0.5051   0.5055  0.4067   0.4509  0.4025     NaN  0.1806
##          ar9     ar10     ar11     ma1     ma2     ma3     ma4     ma5
##       0.1164  -0.0378  -0.2439  0.7277  0.7589  0.3836  0.1574  0.2916
## s.e.     NaN   0.2477   0.1287     NaN  0.4767  0.6668  0.8873  0.8566
##           ma6      ma7      ma8     ma9     ma10    ma11     sma1
##       -0.1127  -0.3389  -0.6567  -0.567  -0.4270  0.2227  -0.7122
## s.e.   0.8761   0.5208   0.3486     NaN   0.2873     NaN   0.0944
## 
## sigma^2 estimated as 60.17:  log likelihood = -819.34,  aic = 1686.67

Deployment Discussion

  • Due to time contraint and lack of understanding in all of time series model algorithms, this project has only demonstrated the usage of ARIMA model. However, both Auto ARIMA and Manual ARIMA are tested to determine a model with better accuracy, ACF and PACF results are examined to utilized manual ARIMA fitting.

  • Weather data from 1996 to 2017 is used to build the model, however, 2032 Summer Olympics is 13 years away from now which is very distant from now. It is realized that the current model might not stay completely feasible for 2032 prediction and will require further data and model re-fitting in future. Although there is not a significant trend in the data now, it may develop a trend due to global climate change. In addition, it is possible that it can develop a strong correlation among different variables.

  • However, it is believed that there still is reasonable value in the built model. From Summer Olympics precedents, climate from past events can be compared to different months of the city, then when reasonable target months are decided, that information can further develop other economical and political case studies to test the hosting city’s feasibility.

  • ShinyAPP for this model is developed to generate predicted termperature and humidity value of future months. Maximum 240 months(20 years) prediction is possible, which covers up to year 2037, while 2032 is main focus. The predicted value can be seen in both table and graph format. In addition, it also enables users to review past temperature and humidity record by year and month so they can intuitively get understanding in the city’s climate. While all 22 years’ record is available, a slider control is placed to shorten the range if certain period needs to be focused.

Project Codes

All project codes and dataset are on github shinyapp